Runbook: Loki Ingestion Failure

Alert

Prometheus Alert: LokiIngestionRateDrop / LokiRequestErrors / LokiStreamLimitExceeded
Grafana Dashboard: Loki dashboard
Firing condition: Loki ingestion rate drops by more than 50% compared to the previous hour, or Loki returns 429/500 errors to Alloy collectors

Severity

Warning -- Log ingestion failure means platform and application logs are being dropped. This creates gaps in observability and audit trails, violating NIST AU-2 (Audit Events) and AU-4 (Audit Storage Capacity) controls.

Impact

New logs from pods and node journals are not being stored
Grafana log queries return incomplete results
Audit trail gaps for compliance (NIST AU-2, AU-12)
Alert rules based on log queries (LogQL) will not fire
Incident investigation is impaired due to missing log data

Investigation Steps

Check Loki pod status:

kubectl get pods -n logging -l app.kubernetes.io/name=loki

Check Loki logs for errors:

kubectl logs -n logging -l app.kubernetes.io/name=loki --tail=200

Check Alloy (log collector) pod status:

kubectl get pods -n logging -l app.kubernetes.io/name=alloy

Check Alloy logs for delivery errors:

kubectl logs -n logging -l app.kubernetes.io/name=alloy --tail=200 | grep -i "error\|429\|500\|dropped\|failed"

Check Loki metrics for ingestion rate:

Prometheus query: rate(loki_distributor_bytes_received_total[5m])

Check for rate limiting:

kubectl logs -n logging -l app.kubernetes.io/name=loki --tail=200 | grep -i "rate\|limit\|429"

Check Loki storage (filesystem in SingleBinary mode):

kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- df -h /tmp/loki

Check the HelmRelease status:

flux get helmrelease loki -n logging
flux get helmrelease alloy -n logging

Check if Loki can be reached from Alloy:

kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=alloy -o name | head -1) -- wget -q -O- http://loki.logging.svc.cluster.local:3100/ready

Check Loki readiness:

kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- wget -q -O- http://localhost:3100/ready

Resolution

Loki storage full

Check current disk usage:

kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- du -sh /tmp/loki/chunks /tmp/loki/rules

If the storage is full, reduce the retention period temporarily:

# In the Loki HelmRelease values
loki:
  limits_config:
    retention_period: "168h"  # Reduce from 720h to 168h (7 days)

Push the change to Git and reconcile:

flux reconcile helmrelease loki -n logging --with-source

If using PVC storage, expand the PVC (if the StorageClass supports expansion):

kubectl edit pvc storage-loki-0 -n logging

For immediate relief, restart Loki to trigger compaction:

kubectl rollout restart statefulset loki -n logging

Loki rate limiting (429 errors)

Check the current rate limits:

kubectl get helmrelease loki -n logging -o yaml | grep -A 10 "limits_config"

Increase the ingestion rate limits:

loki:
  limits_config:
    ingestion_rate_mb: 10
    ingestion_burst_size_mb: 20
    per_stream_rate_limit: "5MB"
    per_stream_rate_limit_burst: "15MB"

Alternatively, reduce log volume by filtering noisy namespaces in the Alloy configuration

Alloy pods not collecting logs

Check Alloy DaemonSet:

kubectl get daemonset alloy -n logging

Verify all nodes have an Alloy pod:

kubectl get pods -n logging -l app.kubernetes.io/name=alloy -o wide

If a pod is missing, check node taints:

kubectl describe node <node-name> | grep Taints

Restart the Alloy DaemonSet:

kubectl rollout restart daemonset alloy -n logging

Alloy cannot reach Loki

Test connectivity:

kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=alloy -o name | head -1) -- wget -q -O- http://loki.logging.svc.cluster.local:3100/ready

Check NetworkPolicies in the logging namespace:

kubectl get networkpolicies -n logging

Verify the Loki service exists and has endpoints:

kubectl get svc loki -n logging
kubectl get endpoints loki -n logging

Loki pod crash-looping

Check logs from the previous crash:

kubectl logs -n logging -l app.kubernetes.io/name=loki --previous --tail=100

Common causes:
Out of memory (check resource limits)
Corrupted WAL (write-ahead log)
Schema migration failure after upgrade
If WAL corruption is suspected:

# Delete the WAL directory (data since last flush will be lost)
kubectl exec -n logging $(kubectl get pod -n logging -l app.kubernetes.io/name=loki -o name | head -1) -- rm -rf /tmp/loki/wal
kubectl rollout restart statefulset loki -n logging

If out of memory, increase limits in the HelmRelease

High cardinality labels causing memory pressure

Check for high cardinality:

LogQL query in Grafana: count(count by (__name__)({__name__=~".+"}))

Identify namespaces producing the most log volume:

LogQL: sum by (namespace) (rate({namespace=~".+"} | __error__="" [5m]))

Add label drop rules in the Alloy configuration for high-cardinality labels that are not useful

Emergency: log data gap recovery

If logs were lost during the outage:

Check if the Alloy pods buffered logs locally (they do not persist to disk by default -- logs during the outage are lost)
For critical audit events, check the Kubernetes API audit log on the server node:

ssh sre-admin@<server-ip> "sudo cat /var/lib/rancher/rke2/server/logs/audit.log | tail -500"

Check node journals for events during the gap:

ssh sre-admin@<node-ip> "sudo journalctl --since '<start-time>' --until '<end-time>'"

Prevention

Monitor loki_ingester_chunks_stored_total and loki_distributor_bytes_received_total metrics
Alert on ingestion rate drops greater than 50% over 15 minutes
Alert on Loki storage usage at 70% (warning) and 85% (critical)
Set appropriate retention periods based on compliance requirements (currently 720h / 30 days)
Review and optimize Alloy relabeling rules to drop unnecessary labels
Ensure Loki has sufficient memory for the expected log volume
Configure Alloy to drop debug-level logs from noisy namespaces in non-production environments

Escalation

If log ingestion is completely stopped: this is a compliance violation (NIST AU-2, AU-12) -- escalate to platform team lead
If Loki data is corrupted and unrecoverable: restore from Velero backup of the logging namespace
If the issue is caused by excessive log volume from a tenant application: notify the tenant team to reduce log verbosity

🕸️ Ada Research Browser